66 research outputs found
On the usage of the probability integral transform to reduce the complexity of multi-way fuzzy decision trees in Big Data classification problems
We present a new distributed fuzzy partitioning method to reduce the
complexity of multi-way fuzzy decision trees in Big Data classification
problems. The proposed algorithm builds a fixed number of fuzzy sets for all
variables and adjusts their shape and position to the real distribution of
training data. A two-step process is applied : 1) transformation of the
original distribution into a standard uniform distribution by means of the
probability integral transform. Since the original distribution is generally
unknown, the cumulative distribution function is approximated by computing the
q-quantiles of the training set; 2) construction of a Ruspini strong fuzzy
partition in the transformed attribute space using a fixed number of equally
distributed triangular membership functions. Despite the aforementioned
transformation, the definition of every fuzzy set in the original space can be
recovered by applying the inverse cumulative distribution function (also known
as quantile function). The experimental results reveal that the proposed
methodology allows the state-of-the-art multi-way fuzzy decision tree (FMDT)
induction algorithm to maintain classification accuracy with up to 6 million
fewer leaves.Comment: Appeared in 2018 IEEE International Congress on Big Data (BigData
Congress). arXiv admin note: text overlap with arXiv:1902.0935
Metrics for Dataset Demographic Bias: A Case Study on Facial Expression Recognition
Demographic biases in source datasets have been shown as one of the causes of
unfairness and discrimination in the predictions of Machine Learning models.
One of the most prominent types of demographic bias are statistical imbalances
in the representation of demographic groups in the datasets. In this paper, we
study the measurement of these biases by reviewing the existing metrics,
including those that can be borrowed from other disciplines. We develop a
taxonomy for the classification of these metrics, providing a practical guide
for the selection of appropriate metrics. To illustrate the utility of our
framework, and to further understand the practical characteristics of the
metrics, we conduct a case study of 20 datasets used in Facial Emotion
Recognition (FER), analyzing the biases present in them. Our experimental
results show that many metrics are redundant and that a reduced subset of
metrics may be sufficient to measure the amount of demographic bias. The paper
provides valuable insights for researchers in AI and related fields to mitigate
dataset bias and improve the fairness and accuracy of AI models. The code is
available at https://github.com/irisdominguez/dataset_bias_metrics.Comment: 18 pages, 8 figures. Appendix included, 21 additional pages, 20
additional figure
Generative Adversarial Networks for Bitcoin Data Augmentation
In Bitcoin entity classification, results are strongly conditioned by the
ground-truth dataset, especially when applying supervised machine learning
approaches. However, these ground-truth datasets are frequently affected by
significant class imbalance as generally they contain much more information
regarding legal services (Exchange, Gambling), than regarding services that may
be related to illicit activities (Mixer, Service). Class imbalance increases
the complexity of applying machine learning techniques and reduces the quality
of classification results, especially for underrepresented, but critical
classes.
In this paper, we propose to address this problem by using Generative
Adversarial Networks (GANs) for Bitcoin data augmentation as GANs recently have
shown promising results in the domain of image classification. However, there
is no "one-fits-all" GAN solution that works for every scenario. In fact,
setting GAN training parameters is non-trivial and heavily affects the quality
of the generated synthetic data. We therefore evaluate how GAN parameters such
as the optimization function, the size of the dataset and the chosen batch size
affect GAN implementation for one underrepresented entity class (Mining Pool)
and demonstrate how a "good" GAN configuration can be obtained that achieves
high similarity between synthetically generated and real Bitcoin address data.
To the best of our knowledge, this is the first study presenting GANs as a
valid tool for generating synthetic address data for data augmentation in
Bitcoin entity classification.Comment: 8 pages, 5 figures, 4 table
Multi-class strategies for joint building footprint and road detection in remote sensing
Building footprints and road networks are important inputs for a great deal of services. For instance, building maps are useful for urban planning, whereas road maps are essential for disaster response services. Traditionally, building and road maps are manually generated by remote sensing experts or land surveying, occasionally assisted by semi-automatic tools. In the last decade, deep learning-based approaches have demonstrated their capabilities to extract these elements automatically and accurately from remote sensing imagery. The building footprint and road network detection problem can be considered a multi-class semantic segmentation task, that is, a single model performs a pixel-wise classification on multiple classes, optimizing the overall performance. However, depending on the spatial resolution of the imagery used, both classes may coexist within the same pixel, drastically reducing their separability. In this regard, binary decomposition techniques, which have been widely studied in the machine learning literature, are proved useful for addressing multiclass problems. Accordingly, the multi-class problem can be split into multiple binary semantic segmentation sub-problems, specializing different models for each class. Nevertheless, in these cases, an aggregation step is required to obtain the final output labels. Additionally, other novel approaches, such as multi-task learning, may come in handy to further increase the performance of the binary semantic segmentation models. Since there is no certainty as to which strategy should be carried out to accurately tackle a multi-class remote sensing semantic segmentation problem, this paper performs an in-depth study to shed light on the issue. For this purpose, open-access Sentinel-1 and Sentinel-2 imagery (at 10 m) are considered for extracting buildings and roads, making use of the well-known U-Net convolutional neural network. It is worth stressing that building and road classes may coexist within the same pixel when working at such a low spatial resolution, setting a challenging problem scheme. Accordingly, a robust experimental study is developed to assess the benefits of the decomposition strategies and their combination with a multi-task learning scheme. The obtained results demonstrate that decomposing the considered multi-class remote sensing semantic segmentation problem into multiple binary ones using a One-vs-All binary decomposition technique leads to better results than the standard direct multi-class approach. Additionally, the benefits of using a multi-task learning scheme for pushing the performance of binary segmentation models are also shown.Christian Ayala was partially supported by the Goverment of Navarra under the industrial PhD program 2020 reference 0011-1408-2020-000008. Mikel Galar was partially supported by Tracasa Instrumental S.L. under projects OTRI 2018-901-073, OTRI 2019-901-091 and OTRI 2020-901-050, and by the Spanish MICIN (PID2019-108392GB-I00 / AEI / 10.13039/501100011033)
- …